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Focus of this tutorial 

- Phenomena 

- Concepts 

- Approaches & Resources 

What is ‘Arabic’? 

- Arabic Script 

- Arabic Language 

• Modern Standard 
Arabic (MSA) 

• Arabic Dialects 



Road Map 


• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 
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Road Map 


Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 


Arabic Script 
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Arabic Script 


Arabic script is an alphabet with allographs variants, 
optional zero-width diacritics and common ligatures. 




Arabic script is used to write many languages: Arabic, 
Persian, Kurdish, Urdu, Pashto, etc. 
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Arabic Script 

Alphabet 

• letter forms 

J J /) U O3 6 ? 

• letter marks . .. ... c 

• Arabic only . .. c 

• Other languages 

:: : v v I 

• Persian, Kurdish, 

Urdu, Pashto, etc. .. "TT "V" 




OCR output ambiguity 
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Arabic Script 


Alphabet (MSA) 

• letters (form+mark) 

• Distinctive 

/;/ /S/ 


♦♦ ♦♦ 

uuu 

% 

70/ 7t7 /b/ 


• Non-distinctive 

£ 

/?/ 

glottal stop aka hamza 


ojUilil 



Arabic Script 


Letter Shapes 

• No distinction between print and handwriting 

• No capitalization 

• Right-to-left 

• Ambiguous 
shapes 

• Connective 
letters 

• Disconnective 
letters 
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Arabic Script 


Letter shaping 

= K-jjS V O dJ 

/katab/ b t k 

to write 

= I o vil 

/kitab/ bat k 

book 
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Arabic Script 

Diacritics 

• Zero-width characters 

• Used for short vowels 

/katab/ to write 

• Nunation is used for 
nominal indefinite 
marker in MSA 

l£ 

V /kitabun/ a book 





Arabic Script 


Diacritics 

• No-vowel marker ( sukun ) 

— o — 

/maktab/ office 

• Double consonant marker 

(shadda) 

UU — 

/kattab/ to dictate 

• Combinable * * 

% % 



/bbu/ /bbin/ /bban/ 
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Arabic Script 


Putting it together 

Simple combination 

Arab /9arab/ 

v j 8, 


o — 


West /narb/ 

v j 8 


o — 


Ligatures 

Peace /salam/ 


/)\L*jU /J^LjuU I J 



Arabic Script 


Tatweel 

• 'elongation' 

• aka kashida 

• used for text highlight “* 4 ^ <3j — 

and justification 



human rights /huquq al?insan/ 
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Arabic Script 


• Different styles 

• High fluidity 

• Optional ligatures 

• Vertical 
arrangements 


/9arabi/ /mufiammad / /altfeabr/ 


Arabic 

Muhammad 

algebra 

♦♦ ♦ 




A 



JuOJ^jO 

^1 



>41 
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Arabic Script 

"Arabic" Numerals 

• Decimal system 

• Numbers written left-to-right in right-to-left text 

. yjJi uip 132 jjt; 1962 

Algeria achieved its independence in 1 962 after i 32 years of French occupation. 

• Three systems of enumeration symbols that vary by region 


Western Arabic 

Tunisia, Morocco, etc. 
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Indo-Arabic 

Middle East 
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Eastern Indo-Arabic 

Iran, Pakistan, etc. 
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A 
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Road Map 


• Introduction 

• Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

• Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


MSA Phonology and Spelling 

• Phonological profile of Standard Arabic 

- 28 Consonants 

- 3 short vowels, 3 long vowels, 2 diphthongs 

• Arabic spelling is mostly phonemic ... 

- Letter-sound correspondence 

( S >J 

ljuwhnm lkq f b ? 5 t d s/szrSdxfrdsOtba? 


>0 f J ^ £ £iaia (jft a (ja j u 



18 



MSA Phonology and Spelling 


• Arabic spelling is mostly phonemic ... 

Except for 

• Medial short vowels can only appear as 
diacritics 

• Diacritics are optional in most written text 

- Except in holy scripture 

- Present diacritics mark syntactic/semantic 
distinctions 

• /katab/ to write /kutib/ to be written 

• /hubb/ love /habb/ seed 

• Dual use of j, < s as consonant and long vowel 

- ' (/7,/a/) j (/w/,/u/) lS (/j/,/T/) 
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MSA Phonology and Spelling 


• Arabic spelling is mostly phonemic ... 

Except for (continued) 

• Morphophonemic characters 

- Feminine marker * ( ta marbuta ) 

• j^/kabTr/ (big <$) <Ltt4/kabTra/ (big $) 

- Derivation marker 

• /9asa/ (to disobey (a stick L—&) 

• Hamza variants (6 characters for one phoneme!) 

- (isjfi *) oW* /baha’/ + 3MascSing (his glory) 


20 


MSA Phonology and Spelling 


• Arabic spelling can be ambiguous 

- optional diacritics and dual use of letter 

• But how ambiguous? Really? 

• Classic example 

ths s wht n rbc txt Iks Ik wth n vwls 

this is what an Arabic text looks like with no vowels 

• Not exactly true 

- Long vowels are always written 

- Initial vowels are represented by an I ‘alef 

- Some final short vowels are represented 

ths is wht an Arbc txt Iks lik wth no vwls 
Will revisit ambiguity in more detail again under morphology discussion 
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Road Map 


• Introduction 

• Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

• Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Arabic Script 
Other languages 


Arabic 


■ No more than 3 dots 

■ Dots either above or below 

■ Marks are 1 / 2/3 dots, hamza (t) 
or madda (~) only 

■ Rare borrowing for foreign words 

• v/P/, ^ /v/, J J e /g/, £ /t f / 

• regionally variable 


Not Arabic 


■ Extra marks: haft (v), ring (o), taa ( b ), 
four dots vertical dots (:) 

■ Some Numerals (?,*,?) 



Once you learn the alphabet, it is easier © 
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m \y <5- 


uLjaS 4 j Jj ^ 4j ^3$** 4 j$j 

( > ) uUAm <^L*x 0 JI40 uj^i; f^J 43 4J l^jAJ ^jp 
( T ) ^pJ^aa-U 4A jU* 4L?* J<U4J>J 

Lri*J^ & c£-> *J jL£;>0 (jJd jlj3 4a 4) 
J i3>" 4J (Ml 4^ ^4 aJ 4^ ^J4J 4JLj j4J4ijJ 
pilj 4 -w 4 j (**3 4£ „£4*3 4£ 4J 4 j^S j 4 j 4 Jjj 
( t ) ^IjU ^ 41 i^ U ,*444^ jjjj 

U Ij 4A 4 j ^dU 4*J 4 4S ^ 4A 4Jj^ 44 
) aLidjJ ^ 4 a 4dd <^44^4 4 j 

^Ij 4^ 4jL> ^4: 4a J>S JUS ^*£->4* La 4^ 4J 
( *V ) 4 4j ^S4a^oa j 4 j 4J 4S d a 40 ^4J^ CSj 4^ 

( v ) ^4 «a 4 j 4J Aa*4j ^4jj 

( A ) a 4 j j 4 j 4 J 4S d a 44 ^40 ^ joaLu 

< ^ ) wjLj ^40 (Uaij4«4 J JjV 4A &)A6 j w£ J 4 4JUS 


□ Arabic 

□ Not Arabic 


^ if- . 


i f 




□ Arabic 

□ Not Arabic 


L aII 

C. Q U«**l »lxJ ^gjl Un ^a^JLJUjlj^j 

>>>L5 J JC. Ul 
^-^11 jlij ^ Jac-Ij 
A_ijL<4j ^JlaJal j 
_j| ^liVl j ^±^JI <■ — bc-^ ^-1 (JjujI 

X <^]1 


tiiAj (j>4 Clll2JU-^]l (JjuJ^jI Vj 
ciLU^I e ui i>«ol Vj 
1. U>»an J^-3 
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□ Arabic 

□ Not Arabic 




\ '/ji t ii—£ 
\A y+r £}£— w </i_x 

+T * - 




Road Map 


• Introduction 

• Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

• Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Encoding Issues 


Encoding Arabic 

- Data entry, storage, and display 

- Ease of use for Arabic-illiterate users 

- Multi-script support 

- Multilingual support (extended Arabic characters) 

Types of Encoding 

- Machine character sets 

• Graphemic (shape insensitive, logical order) 

• Allographs (shape/direction sensitive) [obsolete] 

- Human accessible 

• Transliteration 

• Phonetic spelling (IPA) 

• Romanization 



Encoding Issues 

• Many Conflicting Character Sets for Arabic 



Encodings 

• CP-1256 

- Commonly used 

- 1 -byte characters 

- Widely supported 
input/display 

- Minimal support for 
extended Arabic 
characters 

- bi-script support 
(Roman/Arabic) 

- Tri-lingual support: 
Arabic, French, 
English (ala ANSI) 


Codepage 1256 - Arabic Windows 
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Encodings 

• Unicode 

- Becoming the 
standard more and 
more 

- 2-byte characters 

- Widely supported 
input/display 

- Supports extended 
Arabic characters 

- Multi-script 
representation 
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Encodings 


FE70 


Arabic Presentation Forms-B 


FEFF 


• Unicode 

- Supports presentation 
forms (shapes and 
ligatures) 


FE7 FE8 FE9 FEA FEB FEC FED FEE FEF 


FEW 

* 


* 

L 

* 

* 

l 


FE71 

T 

FE81 


-r 




L 

<s 

•J 

FE72 

T 

7 

L 

=A2 


J. 

FEC2 


r 

FEF2 

Si 

FE73 

i 

FE83 



FEB3 

\p 

FEC3 

FED3 

FEE3 

FEF3 
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FEA4 


w 

i 

- 
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FC40 Arabic Presentation Forms-A FD1F 


FC4 FC5 FC6 FC7 FC8 FC9 FCA FCB FCC FCD FCE FCF FDO FD1 



FC50 

FCfiO 

Jv. 

FC70 
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FC83 

L* 

FC90 
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«■ 
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£ 
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J* 
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4 
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Encoding Issues 

Arabic Display 

• Memory (logical order) -> 

O0NBE YaO0ia (Palestine) Yi 0asaaEi0I (Olympics) 2000 as 2004. 

(Palestine) ^ I (Olympics) 2000 j 2004. 

or this way for those with direction-bias 

.4002 as 0002 ) scipmylO ( IQiEaaasQ iY ) enitselaP ( a!0OaY EBNQO 
.4002 j 0002 ) scipmylO ( ) enitselaP ( I $ 
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Encoding Issues 

Arabic Display 

• Memory (logical order) 

OQNBE YaO01a (Palestine) Yi 0asaaEl0I (Olympics) 2000 as 2004. 

(Palestine) ^ I jjpp^lo (Olympics) 2000 j 2004. 

• Display (visual order) 

- Bidirectional (BiDi) support 

• Numbers and Roman script 

.2004 j 2000 (Olympics) ol^ppjjl (Palestine) I ^ 

- Letter and ligature shaping 

.2004 j 2000 (Olympics) o j I <j (Palestine) clYj 
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Display Problems 



Display Encoding 

CP-1256 

ISO-8859 

Unicode 

Western 

Actual Encoding 

CP-1256 

S ja. AjLkla 

Aojjjoiivi SjUiii 

lal£ JAUnS § in, 

Lilil S jlVi^ uJQAij jj 

yO 00d*0 

iljnnO^nAO 

ElOia aa0&E INE 
Yi IEi aaEIQNE 
QaQaBENaeaiE 

ISO-8859 

jLsiejx jxa Sja s 

Je c_j J e s jUais 
ji^juieje® 

S AaIoIa Jj 

AjjjjjSiVl S jlVlU 

vnsggE 

00 tfnqiD OO0GG 
(figltf 

ElOeae aaexaE INE 
ae lEe aaElQNE 
QaQaaENeaeeE 

Unicode 

|» -j-Jaala'ia la$ia? 

©Ja+JaJa ©L^Ia-Jaj-ia. . .la 

Afc.fc 

©Ja+JagJa-iJa^Ja ? ^ & 

ia^Ia^Ja^ia^Ly 

L%laUa±la$ia© 

Ji? i □£ osc gOgO 
^g_Dia □ g_Dia □ 

Ja-JaEUaD t D t D 

Jo JoEgJ] 

e D e Diini3 1 jini=,ni=.Eft| 

JaEgJhaEgjyhaEiaE^ 

D^D^Djin 

^ Aj&kia 

AajjjaihM SjlaoU 

i» i 0 a 0 _ 0 , USUt 
U...Ut0 * U, 0© 

0-0±0© uDus 

0 — 0"US 

U „ U „ 0 a 0-<0§0±0© 

0§U„0§U„Uf0 a 0±U 

~UtUS0© 


• Wrong encoding • Partial support problems 
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Encoding Issues 

Arabic Input 


• Standard graphemic keyboard 

• Logical order input 



& 

^ http://www.cyrillic.com/kbd/btc.html 




Encodings 

Buckwalter Encoding 

• Romanization 

- One-to-one mapping 
to Arabic script spelling 

- Left-to-right 

- Easy to learn/use 

- Human & machine compatible 

• Commonly used in NLP 

- Penn Arabic Tree Bank 

• Some characters can be 
modified to allow use with XML 
and regular expressions 

• Roman input/display 

• Monolingual encoding (can’t do 
English and Arabic) 

• Minimal support for extended 
Arabic characters 


* T j * J 1 

' I J r m 

' > j z j n 

j & s & h 

\ < Cy> S 3 w 

Li } S Li Y 

1 A CK* D Li y 

L_j b L T - F 

ftp ia Z ^ N 

Cj t £ E “ K 

lIi v ^ g - a 

2 j — _ - u 

£ H i-a f - i 

C * l 3 <3 - ~ 

J d tii k — o 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Morphology 


Type 

- Concatenative: prefix, suffix, circumfix 

- Te m p I ati c : root+ patte rn 

Function 

- Derivational 

• Creating new words 

• Mostly templatic 

- Inflectional 

• Modifying features of words 

- Tense, number, person, mood, aspect 

• Mostly concatenative 



Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Derivational Morphology 


• Templatic Morphology 

• Root t_j Cj dl 

b t k 


• Pattern 



Lexeme 


L_J Jjf 

maktub 

written 






Lexeme. Meaning = 

(Root. Meaning+Pattern. Meaning) * Idiosyncrasy . Random 


katib 

writer 


Derivational Morphology 

Root Meaning 

• v ^ ^ KTB = notion of “writing” 


/kitab/ /katab/ 
book write 

^IjuiSuO ^ 

/maktaba/ /n ^£ b/ 
llbrar y , letter „ 

^-julSuO 

/maktab/ /katib/ 

office writer 


uO 

/maktub/ 

written 
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Derivational Morphology 

Root Meaning 

• LHM-1 

• Notion of “meat” 

— ^ /lafim/ 

• Meat 

— /lahham / 

• Butcher 
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Derivational Morphology 
Root Meaning 


• LHM-2 

• Notion of “battle” 

— /malfiama/ 

• Fierce battle 

• Massacre 

• Epic 




Derivational Morphology 

Root Meaning 


• LHM-3 


• Notion of “soldering” 

— /lafiam/ 

• Weld, solder, stick, cling 

— /iltafiam/ 

• Be welded/soldered/fused 

— /multafiim/ 

• Welded, soldered, fused 
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Derivational Morphology 

Pattern Meaning 


• Verb Pattern Meaning is hard to define 


Pattern 

Pattern Meaning 

Example 

Gloss 

I 

Ia2a3 

Basic sense of root 

ktb -> katab 

write 

II 

Ia22a3 

Intensification, causation 

ktb -> kattab 

dictate 

III 

laA2a3 

Interaction with others 

ktb -> kaAtab 

correspond with 

IV 

Aal2a3 

Causation 

jls -> Ajlas 

seat 

V 

tala22a3 

Reflexive of Pattern II 

Elm -> taEal~am 

learn 

VI 

talaA2a3 

Reflexive of Pattern III 

ktb -> takaAtab 

correspond 

VII 

Ainla2a3 

Passive of Pattern 1 

ktb Ain katab 

subscribe/enroll 

VIII 

Ailta2a3 

Acquiescence, exaggeration 

ktb -> Aiktatab 

register 

IX 

Ail2a33 

Transformation 

Hmr AiHmarr 

Turn red/blush 

X 

Aistal2a3 

Requirement 

ktb -> Aistaktab 

ask/make_write 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Inflectional Morphology 

Derivational Morphology 

- Lexeme * Root + Pattern 

Inflectional Morphology 

- Word = Lexeme + Features 

Features 

- Part-of-speech 

• Traditional : Noun, Verb, Particle 

• Computational : N, PN, V, Adj, Adv, P, Pron, Num, Conj, Det, 
Aux, Pun, I J, and others 

- Noun-specific 

• Number: singular, dual, plural, collective 

• Gender: masculine, feminine, Neutral 

• Definiteness: definite, indefinite 

• Case: nominative, accusative, genitive 

• Possessive clitic 
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Inflectional Morphology 


Features (continued) 

- Verb-specific 

• Aspect: perfective, imperfective, imperative 

• Voice: active, passive 

• Tense: past, present, future 

• Mood: indicative, subjunctive, jussive 

• Subject (Person, Number, Gender) 

• Object clitic 

- Others 

• Single-letter conjunctions 

• Single-letter prepositions 



Inflectional Morphology 

Nouns 





/wakabiyutina/ 

ij “I” -(- cz] -|-J 

wa+ka+biyut+na 
and+like+houses+our 
And like our houses 


/walilmaktabat/ 

wa+li+al+maktaba+at 
and+for+the+library+plural 
And for the libraries 


Morphotactics (e.g. J'+J -> J) 
Arabic Broken Plurals (templatic) 
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Inflectional Morphology 

Verbs 



Ulilaa 




/faqulnaha/ 
U +U + JIS +(-j 

fa+qul+na+ha 
so+said+we+it 
5*0 we said it. 


/wasanaquluha/ 

lA + (Jj 5 +(j +(JJ + j 

wa+sa+na+qul+u+ha 
and+will+we+say+it 
And we will say it 


Morphotactics 

Subject conjugation (suffix or circumfix) 
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Inflectional Morphology 


• Perfect verb subject conjugation ( suffixes only) 



Singular 

Dual 

Plural 

1 

katabtu 

katabna 

2 

katabta 

DuiS katabtuma 

katabtum 

3 

V& kataba 

kataba 

katabtu 

• 

Imperfect verb subject conjugation ( prefix+suffix ) 


Singular 

Dual 

Plural 

1 

aktubu 

naktubu 

2 

taktubu 

taktuban 

ojjjSj taktubun 

3 

yaktubu 

oU&i yaktuban 

ujf&y yaktubun 
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Feminine form and other verb moods not shown 


Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Morphological Ambiguity 

Derivational ambiguity 

- Sjsela; basis/principle/rule, military base, Qa'ida/Qaeda/Qaida 

Inflectional ambiguity 

- you write, she writes 

- Segmentation ambiguity 

• he found; ^+j: and+grandfather 

• Aili; Aii+J ; for a language; for the language 

Spelling ambiguity 

- Optional diacritics 

• /katib/ writer , /katab/ to correspond 

- Suboptimal spelling 

• Hamza dropping: ',)->! 

• Undotted ta-marbuta: » » 

• Undotted final ya: <_s -> ls 



Morphological Ambiguity 

• Multiple sources of ambiguity 

- /bayyana/ Verb he declared/demonstrated 

- / bayyanna/ Verb they [feminine] declared/demonstrated 

- /bayyin/ Adj clear/evident/explicit 

- /bayna/ Prep between/among 

- /biyin/ Proper Noun in Yen 

- /biyn/ Proper Noun Ben 

• Hard to measure specific causes of ambiguity 

- Derivational ambiguity* (diacritized tokens) 

• 1.09 entries/token 

• 1.01 entries/token (within same part-of-speech) 

- Spelling ambiguity* (undiacritized tokens) 

• 1 .28 entries/token 

• 1 .08 entries/token (within same part-of-speech) 
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* in Buckwalter’s Lexicon (~ 40,000 lexemes) 



Morphological Ambiguity 


Average overall ambiguity* is 2.5 analyses/word 

• Compare to English ENGTWOL ambiguity (1.7-2. 2 analyses/word) 

40% 



Analyses/Word 
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r In Arabic Penn Treebank 1 


Road Map 


• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 


Arabic Computational Morphology 

• Representation units 

• Natural token ol i_^SLjoJU3 

-White space separated strings (as is) 

-Can include extra characters (e.g. tatweel/kashida) 

• Word 0UI0JU3 

• Segmented word 

-Can include any degree of morphological analysis 
-Pure segmentation: oLiSuoJ J 3 

-Arabic Treebank tokens (with recovery of some 
deleted/modified letters): oUSuoJI J $ 
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Arabic Computational Morphology 

• Representation units (continued) 

• Prefix + Stem + Suffix 

— O I + 1-juiSuO + JJ3 

-Can create more ambiguity 

• Lexeme + Features 

— c uaSuo[+ P lural +Def + 3 + J] 

• Root + Pattern + Features 

— + e>a3a21a/> + [+ Plural +Def +J + 3 ] 
-Very abstract 

• Root + Pattern + Vocalism + Features 

— + 632 . 1 /) + a. a. a + [+Plural +Def +J + 3 ] 
-Very very abstract 
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Arabic Computational Morphology 


Approaches 

- Finite state machines (Beesely,2001) (Kiraz,2001) (Habash et al, 2005b) 

- Concatenative analysis/generation (Buckwiater,2002) (Cavaiii-Sforza et 
al, 2000) 

- Lexeme+Feature analysis/generation (Habash, 2004 ) 

- Shallow stemming (Darwish,2002) (Aljlayl and Frieder 2002) 

- Machine learning (Diab et al,2004) (Lee et al,2003) (Rogati et al, 2003) 
(Habash & Rambow 2005a) 

Issues 

- Appropriateness of system representation for an application 

• Machine Translation vs. Information Retrieval 

• Arabic spelling vs. phonetic spelling 

- System coverage 

- System extendibility 

- Availability to researchers 

- Use for analysis and generation 



Road Map 
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Orthography 

Morphology 
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- Morphology and Syntax 
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- Phrase Structure 

- Computational Resources 
Machine Translation Issues 
Dialects 


Morphology and Syntax 

• Rich morphology crosses into syntax 

- Pro-drop / Subject conjugation 

- Verb subcategorization and object clitics 

• Verb transitive +subject+object 

• V erb intransitive +subject but not Verb intransitive +subject+object 

• Verbpassive+subJect but not Verb P assive +su bject+object 

• Morphological interactions with syntax 

- Agreement 

• Full: e.g. Noun-Adjective on number, gender, and definiteness 

• Partial: e.g. Verb-Subject on gender (in VSO order) 

- Definiteness 

• Noun compound formation, copular sentences, etc. 

• Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc. 
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Morphology and Syntax 

Morphological interactions with syntax (continued) 

- Case 

• MSA is case marking: nominative, accusative, genitive 

• Almost-free word order 

• Case is often marked with optionally written short vowels 

- This effectively limits the word-order freedom in published text 

Agglutination 

- Attached prepositions create words that cross phrase 
boundaries 

0L&0JI+J li+Almaktabat 

for the-libraries [PP li [NP Almaktabat]] 

Some morphological analysis ( minimally segmentation) 
is necessary even for statistical approaches to parsing 
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Orthography 

Morphology 
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- Morphology and Syntax 
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- Phrase Structure 

- Computational Resources 
Machine Translation Issues 
Dialects 


Sentence Structure 


Two types of Arabic Sentences 

• Verbal sentences 

- [Verb Subject Object] (VSO) 

Wrote the-boys the-poems 
The boys wrote the poems 

• Copular sentences 

- [Topic Complement] 

the-boys poets 
The boys are poets 
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Sentence Structure 


Verbal sentences 

- Verb agreement with gender only 

* jVjVMjH wrote 3MascSing the-boy/the-boys 
. cji_u]i\dm]i cjiiS wrote 3FemSing the-girl/the-girls 

- Pronominal subjects are conjugated 

* ^ wrote-you MascSing 

* wrote-you MascPlur 

. wrote-they MascPlur 

- Passive verbs 

* Same structure. Verbp asS j Ve Subject unC | er iyj ng object 

* Agreement with surface subject 


Sentence Structure 

• Verbal sentences 

- Common structural ambiguity 

• Third masculine/feminine singular are structurally 

ambiguous 

— Verb 3MascSin g U | ai . Noun Masc 
Verb subject=he object=Noun 
Verb subject=Noun 

• Passive and active forms are often similar in 
standard orthography 

- /kataba/ he wrote 

- /kutiba/ it was written 
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Sentence Structure 


Copular sentences 

- [Topic Complement] 

Definite Topic, Indefinite Complement 

• jc.Li jll 

the-boy poet 
The boy is a poet 

- [Auxiliary Topic Complement] 

Auxiliaries ( kana and her sisters) 

• Tense, Negation, Transformation, Persistence 

. js, W as the-boy poet The boy was a poet 

. is-not the-boy poet The boy is not a poet 

- Inverted order is expected in certain cases 

• Indefinite topic 

Aandi kitabun/ at-me a-book / have a book 


Sentence Structure 


• Copular sentences 

- Types of complements 

• Noun/Adjective/Adverb 

- ^ the-boy smart The boy is smart 

• Prepositional Phrase 

- the-boy in the-library The boy is in the library 

• Copular-Sentence 

- [the-boy [book-his big]] The boy, his book is big 

• Verb-Sentence 

- jUJiVI I jfS 

[the-boys [wrote-they poems]] The boys wrote the poems 

- Full agreement in this order (SVO) 

- jU^VI 

[the-poems [wrote-it the boys]] The poems, the boys wrote 
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Road Map 


Introduction 

Orthography 

Morphology 
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- Morphology and Syntax 

- Sentence Structure 

- Phrase Structure 

- Computational Resources 
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Dialects 


Phrase Structure 


Noun Phrase 

- Determiner Noun Adjective PostModifier 

• jUUll ^ l->\' ljjISI! 

this the-writer the-ambitious the-arriving from Japan 
This ambitious writer from Japan 

- Noun-Adjective agreement 

• number, gender, definiteness 

- ^j*3=3i ^£3' the-writer fem the-ambitious fem 

- ciJSlt the-writer femPlur the-ambitious femPlur 



Phrase Structure 


• Noun Phrase 

- Idafa construction (*sL±\) 

• Nouni of Noun2 encoded structurally 

• Nouni -indefinite Noun2-definite 

king Jordan 

the king of Jordan /Jordan’s king 

- Nouni becomes definite 

• Agrees with definite adjectives 

- Idafa chains 

• N 1 jnd efN 2 indef ■■■ N 11 ' 1 indef def 

son uncle neighbor chief committee management the- 
company 

The cousin of the CEO’s neighbor 
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Phrase Structure 


• Morphological definiteness interacts with syntactic structure 




Word 1 M/r/'fer 



definite 

Indefinite 


CD 

Noun Phrase 

Noun Compound 

s— . 

CO 

CO 

-<n 

jUall 

jUall 

'c 

M— 

CD 

"O 

The artist(ic) writer 

The writer of the artist 

% 




CM 

"O 

0) 

Copular Sentence 

Noun Phrase 

o 

§ 

c 

M— 


jUa 


CD 

"D 

C 

The writer is an artist 

An artist(ic) writer 
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Computational Resources 

Monolingual corpora for building language models 

- Arabic Gigaword 

• Agence France Presse 

• AlHayat News Agency 

• AnNahar News Agency 

• Xinhua News Agency 

- Arabic Newswire 

- United Nations Corpus (parallel with other UN languages) 

- Ummah Corpus (parallel with English) 

Distributors 

- Linguistic Data Consortium (LDC) 

- Evaluations and Language resources Distribution Agency 
(ELDA) 



Computational Resources 

• Penn Arabic Treebank (PATB) 

- Started in 2001 

- Goal is 1 Million words 

- Currently 650K words 

• Agence France Presse , AlHayat newspaper, AnNahar 
newspaper 

• POS tags 

- Buckwalter analyzer 

- Arabic-tailored POS list 

• PATB constituency 
representation 

- Some modifications of Penn English Treebank 

• (e.g. Verb-phrase internal subjects) 
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Computational Resources 






Prague Dependency Treebank 


Currently 100k words 

Partial overlap with PATB 
and Arabic Gigaword 

- Agence France Presse, 
AlHayat and Xinhua 

Morphological analysis 

- Similar to PATB 



???_Pred 
ya+SoEad 
(he) gets en 

Sb Obj 

-huwa Al+bAS 

he the bus 


AuxY 

wa- 

and 


• Dependency representation 
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Graphic courtesy of Otakar Smrz: http://ckl.mff, 


icl-trees.ppt 


Computational Resources 

• Applications using Penn Arabic Treebank 

- Statsitical parsing 

• Bikel’s parser (Bikel 2003) 

- Same engine used with English, Chinese and Arabic 

- POS tagging and morphological disambiguation 

• (Diab et al, 2004) and (Habash and Rambow, 2005a) 

• Arabic pos tagging (Khoja, 2001 ) 

• Formalism conversion 

- Constituency to dependency (Zabokrtsky and Smrz 2003) 

- Tree-adjoining grammar extraction (Habash and Rambow 
2004) 

• Automatic diacritization 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

- Morphology and Translation 
-Translation Divergences 

- Computational Resources 

• Dialects 


Morphology and Translation 
which level to go down to? 

• Natural token ol |lJl5LxxJLI3 

• Word oLi 5 uoJU 3 

• Segmented Word oLutSLoJI J 3 

• Prefix + Stem + Suffix ol+^dSuo+JJ 3 

• Lexeme + Features qujSj* [+piurai +Def +j + 3 ] 

• Root + Pattern + Features 

u o vil + 6d3d2 l3/> + [+Plural +DGf +J + 3 ] 
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Morphology and Translation 
What approach? 

• Natural token Not Appropriate 

• Word Statistical MT 

• Segmented Word Statistical MT 

• Prefix + Stem + Suffix Statistical/Symbolic 

• Lexeme + Features Symbolic MT 

• Root + Pattern + Features Too Abstract? 
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Morphology and Translation 

What resources? 

• Available resources may span different levels of 
representation! 

• Most dictionaries are lexeme-based 

• Buckwalter stem dictionary contains English glosses 

• Statistical translation lexicons depend on the type of 
tokenization used before alignment 

- Word (no disambiguation necessary) 

- Segmented word (minimal disambiguation necessary) 

- Stem/Lexeme (machine/human disambiguation necessary) 

• Consistency is important 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

- Morphology and Translation 
-Translation Divergences 

- Computational Resources 

• Dialects 


Translation Divergences 

• Beyond word-order variation 

- Arabic VSO - English SVO 

- Arabic N Adj - English Adj N 

• Meaning of two translationally equivalent constituents is 
distributed differently in two languages 

• Divergence dimensions 

- Categorial Variation (develop -> development) 

- Conflation (become frozen freeze) 

- Inflation (freeze -> become frozen) 

- Structural (enter the room -> enter into the room) 

- Head Swap (swim across the river -> cross the river swimming) 

- Thematic (John likes Mary -> Mary pleases John) 
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Translation Divergences 

conflation 




I have a book 


at-me book 
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Translation Divergences 
conflation 




^ I am not here 

I-am-not here 
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Translation Divergences 
structural 



* - 1 \ 

book Nizar 



Nizar’s book 
Book of Nizar 
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Translation Divergences 
structural 




cjjjc. i found the book 

found-I upon the-book 
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Translation Divergences 

thematic & conflational 




head-my hurts-me my head hurts I have a headache 


89 


Translation Divergences 

head swap and categorial 



4-a.Lu j$ill jjjc I swam across the river quickly 

I-sped crossing the-river swimming 
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Road Map 


• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

- Morphology and Translation 
-Translation Divergences 

- Computational Resources 

• Dialects 


Computational Resources 

Dictionaries 

- Buckwalter stem dictionary (LDC) 

- Salmone dictionary (Tufts university) 

- Online dictionaries - Ajeeb.com (Sakhr), Almisbar.com, 
Ectaco.com 

Parallel corpora (LDC) 

- United Nations Corpus (parallel with other UN languages) 

- Ummah Corpus (parallel with English) 

- Arabic News Translation Corpus 

- Arabic Treebank English Translation 

- More on LDC webpage . . . 

MT evaluation 

- Arabic-English Multi-translation Corpus (LDC) 

- NIST’s MT-EVAL 

• Statistical MT systems are the state-of-the-art 
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lam ja/tari nizar tawilatan ^adldatan >i ^4 *1 


didn’t buy Nizar table new 

nizar ma/taraj tarabeza gidTda 
nizar ma/taraj tawile ^dlde 

nizar maJYaJ - mida ^dlda 

Nizar not-bought-not table new 


o (jil 

v W 
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General Definitions 


What is a ‘dialect’? 

- Political and Religious factors 
Modern Standard Arabic 
Regional Dialects 

- Egyptian Arabic (EGY) 

- Levantine Arabic (LEV) 

- Gulf Arabic (GULF) 

- North African Arabic (NOR) 

- Iraqi, Yemenite, Sudanese, Maltese? 

Social dialects 

- City 

- Peasant 

- Bedouin 



General Definitions 


Diglossia 
Badawi’s levels 


- Traditional Arabic 

- Modern Arabic 

- Educated Colloquial 

- Literate Colloquial 


- Illiterate Colloquial 

Polyglossia 




□ ■ □ 


Classical Dialect Foreign 96 
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Phonological Variation 


MSA 

(J 3 d L) ^ J (j^a <j^ CH LHJ \ <j [$ J J It f 



ljuwhnm lkqfB^Std SJszr8dxh^0tba? 

LEV 

ciJ *0 f J ^ jJJ£££&SCjlj 1 & [$3 j " * 



_/ ij_uw h 
e o 


No dialect-specific standard orthography 
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Lexical Variation 


• Arabic Dialects vary widely lexically 


English 

table 

cat 

of 

(I) want 

there is 

there isn't 

MSA 

Tawila 

qiTTa 


idafa 

| uridu | 

yujadu 

la yujadu 

Moroccan 

mida 

qeTTa 


dyal 

bgit 

kayn 

ma kayns 

Egyptian 

Tarabeza 

"oTTa 


biti3 

3awez 

fi" 

mafis 

Syrian 

Tawle 

bisse 


taba3 

biddi 

fi 

ma fi 

Iraqi 

mez 

bazzuna 

mal 

"arid 

aku 

tnaku 


• Arabic orthography allows consolidating some 
variations 
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Morphological Variation 

Nouns 

- No case marking 

• Word order implications 

- Paradigm reduction 

• Consolidating masculine & feminine plural 

Verbs 

- Paradigm reduction 

• Loss of dual forms 

• Consolidating masculine & feminine plural (2 nd , 3 rd person) 

• Loss of morphological moods 

- Subjunctive/jussive form dominates in some dialects 

- Indicative form dominates in others 

- Other aspects increase in complexity 



Morphological Variation 

Verb Morphology 



MSA 

a! U j/K-; 

walam taktubuha lahu 
wa+lam taktubu+ha la+hu 

and+not_past write_you+it for+him 


EGY 

wimakatabtuhaluj 

wi+ma+katab+tu+ha+lu+/ 

and+not+wrote+you+it+for_him+not 


And you didn’t write it for him 
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Morphological Variation 

Verb conjugation 


• Perfect verb derivation ( suffixes only) 



1 st Person Singular 

2 nd Person 
Singular S 

2 nd Person 
Singular $ 

MSA 

katabtu 

katabta 

katabti 

LEV 

katabt 

katabti 


• Imperfect verb derivation (prefix+ suffix) 



1 st Person Singular 

2 nd Person 
Singular S 

2 nd Person 
Singular $ 

MSA 

aktubu 

taktubu 

taktublna 

taktubT 

LEV 

aktob 

toktob 

gr^ toktobi 


Morphological Variation 


Tense expression 



Perfect 

Imperfect 

M 

S 

A 

L_L& 

kataba 

Past 

jaktubu 

Present 



l \ k n 

sajaktubu 

Future 

L 

E 

V 

katab 

Past 

jiktob 

0-Tense 

i £a£jj 

bjoktob 

Present 

habitual 

9am bjoktob 
Present 

progressive 

C. 

bajiktob 

Future 
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Syntactic Variation 

• Verbal sentences 

- The children wrote poems 

- MSA 

• Verb Subject Object (Partial agreement) 

wrote masc the-boys the-poems 

• Subject Verb Object (Full agreement) 

jUJiVI I jjS 

the-boys wrote mascPlural the-poems 

- LEV, EGY 

• Subject Verb Object 

The-boys wrote mascPlural the-poems 

• Less present: Verb Subject Object 

jUJiVI jjS 

wrote maS cpi U rai the-boys the-poems 

• Full agreement in both order 
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Syntactic Variation 

Noun Phrase 

- Idafa construction 

• Nouni of Noun2 encoded structurally 

king Jordan 

the king of Jordan /Jordan’s king 

- Dialects have an additional common construct 

• Nouni <particle> Noun2 

• LEV: >jV! the-king belonging-to Jordan 

• <particle> differs widely among dialects 

- Pre/post-modifying demonstrative article 

• MSA: Ja-jii ti* this the-man this man 

• EGY: Ja-i jli the-man this this man 
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Code Switching 


MSA and Dialect mixing in speech 

• phonology, morphology and syntax 


MSA 

LEV 


ddaUlL I^LUa ^ill ^aA ddall ^ajLl l^jJa^)l*_U ^ AdlaC. AjV Ta1» J La LI V 

AjIj ^)^a!)ll Ajlal^jdad o^l-d ^3 (jjfd AjI LI ^a^)V I (_g-^ c ' p_jda^a Ada pjjdaj-a ^^JLlLj 

Aj^jJ^I j| ^jldl ^3 J^il Ajj .Vqi* jj Ajlal^jLad ‘Luo^jLa-a ^3 (j^jfLa (jlj 4_f2al^L»J.iil Aaxli ^l^)Ii^.l ^3 (jjfd 
j ^aJtj ^ <» < ^ x 11 dll jUdl p^da^a (^ Cl AiaaJ £^.^1 <_£d tp^jdaj-all I^A d^p (jldl ^3 Aa^Luo 
^j^jL^) ^Uaj (_>joL <— Lllall du ^ya (jld] ^3 ^Lail! (_gjuoL^ ^aUaj (jldl ^_^3 ^Lail! (Ja ,1g x \l dll^Li-j] ^jc. 

(j^jfLlJ LL AjL S^jjdVI Aduo^La^a (J^ld did! 3^aJ ^jod^llj Aadi^a Araj^^H diJ I jLc. ^A A LaLU ^^gJLlLj 

dlVL-allVI p^daj-a ^^3 ^^^duo^La-aJ 1 L p^da^all Idk d U mC. Llj (jjx-a *■. U«Vl<a ^3 (Jjjjjoia ^add ^3 
(_j-a t—ljUa^ (jda Lajj AdLa. jA ^jojill c_lliad £_£3daj L_lUad ^j-«uJa A^Jlj^a L_il3l^a TAlj j Lai 
All iqVl\l AJaLudll (_pod^) L_fljliai! (Jjlillj du La ^jLul ^_3 ^Li Ada AiV Aididil A L>LH (_pod^) j£La jA A-j^jg «a-N 
ALaLuiil A ^lAXi A lie* ^jL^a ^ A Laj 1 L>«v ^ A La i : | ^ j‘al l A <L, (dlLLi^^Lall £.ldl A lie- A A ' ^ <~- 

La dlLI I^A $.ldl (jdaiaJ ^^3 wodlj Aoall La JjsljJ ^3 (_3-ld_a A dJaj A-aJL^aua ^3 (_}Ial 

lgj3 dl^^Ja ^3da p jjJa^a ^jld ^jjdll ll 3 >vn Laj) ^ju liadll aLdlL ^)Lui«ail Ll^pJ 

Lalj L^j 3 dia^piil ^1 4_LajdaJl Ajuo^La-alL dll^Luo £Jjfy\ (J^\d dldl LI L^lS l^«a^-^l t^JS l^jlalj Ajt-a IjJjda 
id Ll ^Jal^)daJd]| p^jdaj-al! Lai ipjjJa^all IdA ^ I Ida. 3^aJ ^)jod^)il <jl£ ^^jJaj-all Id^J Idajiill 
t_lldaj| sdc-| ^i3 Adldal jl ^A Aladu jl ^jlLoidll Ail <J jL La ^jda ^)laiil A^_a.^llA I^A LaLaJ 
AjjA ^jua-a ^A AdL AjV^jJ Aj^j^^a^. ^od^)l dill iA La dujjdiHj j-L a\I ^Jal^jiLaJ.i 

_pjjJa^ail La ^_^3 ^^^JC-Ls ^ d j JaVL La A±Ial^)LaJ-l]| 
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Computational Resources 

• Most work on Arabic dialects focuses on Automatic 
Speech Recognition 

• Speech/transcript corpora 

- Egyptian and Levantine Arabic (LDC) 

- Moroccan and Tunisian Arabic (ELDA) 

- Gulf Arabic (Appen) 

- Many other... 

• Few lexicons/morphology resources 

- CallHome Egyptian Arabic monolingual lexicon (LDC) 

- CallHome Egyptian Verb transducer (LDC) 

• Work on multi-dialectic resources 

- Linguistic Data Consortium 

- Columbia University Arabic Dialect Project 

• Pan-Arab lexicon and Pan-Arab Morphology 

• Parsing Arabic Dialects (JHU summer workshop 2005) 111 



Resources 


Distributors 

• Linguistic Data Consortium 

• NEMLAR (Network for Euro-Mediterranean LAnquaqe 

Resources) 

• ELSNET is the European Network of Excellence in 

Human Language Technologies 

• ELDA Evaluation and Language resources Distribution 

Agency 


112 


Resources 


Reports 

• Mohamed Maamouri and Christopher Cieri. 2002. 
Resources for Natural Language Processing at the 

Linguistic Data Consortium . In Proceedings of the 
International Symposium on Processing of Arabic, pages 
125--146, Manouba, Tunisia, April 2002. 

• Mahtab Nikkhou and Khalid Choukri. Survey on Arabic 
Language Resources and Tools in the Mediterranean 

Countries . 

• Arabic Information Retrieval and Computational 

Linguistics Resources (thanks to Doug Oard) 
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Resources 


Monolingual Corpora 

• Arabic Giqaword 

• Arabic Newswire 

Parallel Corpora 

• United Nations Parallel Corpus 

• Ummah Parallel Corpus 

• Arabic News Translation 

• Multiple-Translation Arabic 

Treebanks 

• Arabic Penn Treebank Webpage 

- Part 1 v 2.0 , Part 2 v 2.0 , Part 3 v 1.0 , IQK-word English Translation 

• Prague Arabic Dependency Treebank 
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Resources 


Morphology 

• Buckwalter Arabic Morphological Analyzer 

- Version 1.0. Version 2.0 

• Xerox Arabic Morphology (online) 

Dialect Resources 

• CALLHOME Egyptian Arabic Transcripts 

• CALLHOME Egyptian Arabic Speech 

• Egyptian Colloquial Arabic Lexicon 

• Levantine Arabic Resources 

• http://www.orientel.org/ 

• http://www.appen.com.au 


Resources 


Dictionaries 

• Buckwalter Stem Dictionary 

• H. Anthony Salmone. An Advanced Learner's Arabic- 
English Dictionary encoded by the Perseus Project, Tufts 
University (contact: David Smith dasmith@perseus.tufts.edu) 

• Aieeb Arabic-Enqlish Dictionary (online) 

• Al-Misbar Dictionarx (online) 

• Ectaco Bilingual Dictionan (online) 

Online MT systems 

• Aieeb's Arabic-Enqlish Machine Translation (online) 

• Al-Misbar Enqlish-Arabic Machine Translation (online) 
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Conferences and Workshops 

with some focus on Arabic 


ACL 2005 Workshop on Computational Approaches to Semitic Languages 
Arabic Language Resources and Tools Conference 2004 Cairo, Egypt 

WORKSHOP Computational Approaches to Arabic Script-based Lanquaqes 

(COLING 2004) 

Traitement Automatique du Lanqaqe Naturel (TALN ' 04) 

NIST MT EVAL ( http://www.nist.gov/speech/tests/mt/ ) 

MT Summit IX Workshop on Machine Translation for Semitic Lanquaqes in 

2003 

LREC 2002 Arabic Language Resources and Evaluation Workshop 

ACL 2002 Workshop on Computational Approaches to Semitic Languages 

International Symposium on Processing of Arabic 2002, Tunisia 
Workshop on ARABIC Language Processing: Status and Prospects 

(ACL/EACL 2001) 

Arabic Translation and Localisation Symposium (ATLAS 1999) 
Computational Approaches to Semitic Languages (COLING/ACL 1998) 
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